Assessing the feasibility of large-scale natural language processing in a corpus of ordinary medical records: a lexical analysis

نویسندگان

William R. Hersh

Emily M. Campbell

Susan Malveau

چکیده

OBJECTIVE Identify the lexical content of a large corpus of ordinary medical records to assess the feasibility of large-scale natural language processing. METHODS A corpus of 560 megabytes of medical record text from an academic medical center was broken into individual words and compared with the words in six medical vocabularies, a common word list, and a database of patient names. Unrecognized words were assessed for algorithmic and contextual approaches to identifying more words, while the remainder were analyzed for spelling correctness. RESULTS About 60% of the words occurred in the medical vocabularies, common word list, or names database. Of the remainder, one-third were recognizable by other means. Of the remaining unrecognizable words, over three-fourths represented correctly spelled real words and the rest were misspellings. CONCLUSIONS Large-scale generalized natural language processing methods for the medical record will require expansion of existing vocabularies, spelling error correction, and other algorithmic approaches to map words into those from clinical vocabularies.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Lexical Bundles in English Abstracts of Research Articles Written by Iranian Scholars: Examples from Humanities

This paper investigates a special type of recurrent expressions, lexical bundles, defined as a sequence of three or more words that co-occur frequently in a particular register (Biber et al., 1999). Considering the importance of this group of multi-word sequences in academic prose, this study explores the forms and syntactic structures of three- and four-word bundles in English abstracts writte...

متن کامل

A Corpus-based Study of Lexical Bundles in Discussion Section of Medical Research Articles

There has been increasing interest in utilizing corpora in linguistic research and pedagogy in recent years. Rhetorical organization of different sections of research articles may appear similar in various disciplines, but close examination may show subtle differences nonetheless. One of the features that has been at the center of attention especially in recent years is the idiomaticity of a di...

متن کامل

Extracting Concepts Related to a Homelessness from the Free Text of VA Electronic Medical Records

Mining the free text of electronic medical records (EMR) using natural language processing (NLP) is an effective method of extracting information not always captured in administrative data. We sought to determine if concepts related to homelessness, a non-medical condition, were amenable to extraction from the EMR of Veterans Affairs (VA) medical records. As there were no off-the-shelf products...

متن کامل

Producing a Persian Text Tokenizer Corpus Focusing on Its Computational Linguistics Considerations

The main task of the tokenization is to divide the sentences of the text into its constituent units and remove punctuation marks (dots, commas, etc.). Each unit is a continuous lexical or grammatical writing chain that is an independent semantic unit. Tokenization occurs at the word level and the extracted units can be used as input to other components such as stemmer. The requirement to create...

متن کامل

First Language Activation during Second Language Lexical Processing in a Sentential Context

Lexicalization-patterns, the way words are mapped onto concepts, differ from one language to another. This study investigated the influence of first language (L1) lexicalization patterns on the processing of second language (L2) words in sentential contexts by both less proficient and more proficient Persian learners of English. The focus was on cases where two different senses of a polys...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

Proceedings : a conference of the American Medical Informatics Association. AMIA Fall Symposium

دوره شماره

صفحات -

تاریخ انتشار 1997

Assessing the feasibility of large-scale natural language processing in a corpus of ordinary medical records: a lexical analysis

نویسندگان

چکیده

منابع مشابه

Lexical Bundles in English Abstracts of Research Articles Written by Iranian Scholars: Examples from Humanities

A Corpus-based Study of Lexical Bundles in Discussion Section of Medical Research Articles

Extracting Concepts Related to a Homelessness from the Free Text of VA Electronic Medical Records

Producing a Persian Text Tokenizer Corpus Focusing on Its Computational Linguistics Considerations

First Language Activation during Second Language Lexical Processing in a Sentential Context

عنوان ژورنال:

اشتراک گذاری